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Abstract 

Since being analyzed by Rokhlin, Szlam, and Tygert Ul and popularized by 
Halko, Martinsson, and Tropp randomized Simultaneous Power Iteration has 
become the method of choice for approximate singular value decomposition. It is 
more accurate than simpler sketching algorithms, yet still converges quickly for 
any matrix, independently of singular value gaps. After 0(l/e) iterations, it gives 
a low-rank approximation within (1 -f e) of optimal for spectral norm error. 

We give the first provable runtime improvement on Simultaneous Iteration: a sim¬ 
ple randomized block Krylov method, closely related to the classic Block Lanczos 
algorithm, gives the same guarantees in just 0(l/i/e) iterations and performs sub¬ 
stantially better experimentally. Despite their long history, our analysis is the first 
of a Krylov subspace method that does not depend on singular value gaps, which 
are unreliable in practice. 

Eurthermore, while it is a simple accuracy benchmark, even (1 + e) error for spec¬ 
tral norm low-rank approximation does not imply that an algorithm returns high 
quality principal components, a major issue for data applications. We address this 
problem for the first time by showing that both Block Krylov Iteration and a minor 
modification of Simultaneous Iteration give nearly optimal PCA for any matrix. 
This result further justifies their strength over non-iterative sketching methods. 
Einally, we give insight beyond the worst case, justifying why both algorithms can 
run much faster in practice than predicted. We clarify how simple techniques can 
take advantage of common matrix properties to significantly improve runtime. 


1 Introduction 

Any matrix A G with rank r can be written using a singular value decomposition (SVD) as 

A = USV"^. U G and V G R'^^’’ have orthonormal columns (A’s left and right singular 

vectors) and S G R’’^’" is a positive diagonal matrix containing A’s singular values: cti > ... > ar- 
A rank k partial SVD algorithm returns just the top k left or right singular vectors of A. These are 
the first k columns of U or V, denoted and Vfc respectively. 

Among countless applications, the SVD is used for optimal low-rank approximation and principal 
component analysis (PCAfl Specifically, for fc < r, a partial SVD can be used to construct a rank k 
approximation A^ such that both || A — A^ || p and || A — A^ || 2 are as small as possible. We simply 
set Afc = UfeU^A. That is, Afc is A projected onto the space spanned by its top k singular vectors. 

Eor principal component analysis, A’s top singular vector Ui provides a top principal component, 
which describes the direction of greatest variance within A. The z* singular vector provides the 

'Typically after mean centering A’s columns or rows, depending on which principal components we want. 
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principal component, which is the direction of greatest variance orthogonal to all higher principal 
components. Formally, denoting A’s i* singular value as cr^, 

uf AA^Ui = al = max x^AA^x. 

x:||x||2 = l, x_LujVj<z 

Traditional SVD algorithms are expensive, typically running in 0{n(P) tim^B Hence, there has 
been substantial research on randomized techniques that seek nearly optimal low-rank approxima¬ 
tion and PCA E] E] [U12 ID. These methods are quickly becoming standard tools in practice and 
implementations are widely available Eiiiiiiiini, including in popular learning libraries like scikit- 
learn ifTTl . 

Recent work focuses on algorithms whose runtimes do not depend on properties of A. In contrast, 
classical literature typically gives runtime bounds that depend on the gaps between A’s singular 
values and become useless when these gaps are small (which is often the case in practice - see 
Section[2i. This limitation is due to a focus on how quickly approximate singular vectors converge 
to the actual singular vectors of A. When two singular vectors have nearly identical values they are 
difficult to distinguish, so convergence inherently depends on singular value gaps. 

Only recently has a shift in approximation goal, along with an improved understanding of random¬ 
ization, allowed for algorithms that avoid gap dependence and thus run provably fast for any matrix. 
For low-rank approximation and PCA, we only need to find a subspace that captures nearly as much 
variance as A’s top singular vectors - distinguishing between two close singular values is overkill. 

1.1 Prior Work 

The fastest randomized SVD algorithms llHlhl run in 0(nnz(A)) tim^B are based on non-iterative 
sketching methods, and return a rank k matrix Z with orthonormal columns Zi,..., satisfying 

Frobenius Norm Error; || A — ZZ^A||f < (1-f e)|| A — Afc||i^’. (1) 

Unfortunately, as emphasized in prior work IT]|2112El, Frobenius norm error is often hopelessly 
insufficient, especially for data analysis and learning applications. When A has a “heavy-tail” of 
singular values, which is common for noisy data, || A — A^ |||. = J2i>k huge, potentially 

much larger than A’s top singular value. This renders ([T]l meaningless since Z does not need to 
align with any large singular vectors to obtain good multiplicative error. 

To address this shortcoming, a number of papers ElEElEl suggest targeting spectral norm 
low-rank approximation error. 

Spectral Norm Error: || A — ZZ^A ||2 < (1 + e)|| A — Afc|| 2 , (2) 

which is intuitively stronger. When looking for a rank k approximation, A’s top k singular vectors 
are often considered data and the remaining tail is considered noise. A spectral norm guarantee 
roughly ensures that ZZ^A recovers A up to this noise threshold. 

A series of work EI2E1II61E1 shows that decades old Simultaneous Power Iteration (also called 
subspace iteration or orthogonal iteration) implemented with random start vectors, achieves (|2 after 
0(l/e) iterations. Hence, this method, which was popularized by Halko, Martinsson, and Tropp in 
El, has become the randomized SVD algorithm of choice for practitioners lElEl- 

2 Our Results 

2.1 Faster Algorithm 

We show that Algorithm |2 a randomized relative of the Block Lanczos algorithm ifTSl [T^ . which 
we call Block Krylov Iteration, gives the same guarantees as Simultaneous Iteration (Algorithm (TJ 

^This is somewhat of an oversimplicifcation. By the Abel-Ruffini Theorem, an exact SVD is incomputable 
even with exact arithmetic H. Accordingly, all SVD algorithm are inherently iteratively. Nevertheless, tra¬ 
ditional methods including the ubiquitous QR algorithm obtain superlinear convergence rates for the low-rank 
approximation problem. In any reasonable computing environment, they can be taken to run in 0{nd^) time. 
^Here nnz(A) is the number of non-zero entries in A and this runtime hides lower order terms. 
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in just 0(l/v^) iterations. This not only gives the fastest known theoretical runtime for achieving 
(|2|i, but also yields substantially better performance in practice (see Section[8]l. 

Even though the algorithm has been discussed and tested for potential improvement over Simulta¬ 
neous Iteration lUlllOllII], theoretical bounds for Krylov subspace and Lanczos methods are much 
more limited. As highlighted in IfT^ . 

“Despite decades of research on Lanczos methods, the theory for [randomized 
power iteration] is more complete and provides strong guarantees of excellent 
accuracy, whether or not there exist any gaps between the singular values.” 

Our work addresses this issue, giving the first gap independent bound for a Krylov subspace method. 


Algorithm 1 Simultaneous Iteration 

input: A € error e € (0,1), rank fc <n,d 

output: Z G 

1: q ■= 0(1^), n - A/'(0, 

2: K := (AA'^)'^ An 

3: Orthonormalize the columns of K to obtain 

Q G 

4: Compute M := Q^AA^Q G 

5: Set Ufc to the top k singular vectors of M. 

6: return Z = QUk 


Algorithm 2 Block Krylov Iteration 

input: A G error e G (0, l),rankA: <n,d 

output: Z G R"'"'' 

1: q := 0(i^), H - A/'(0,1)'^'^'= 

2: K := [An, (AA^)An,...,(AA'^)«An] 

3: Orthonormalize the columns of K to obtain 

Q g 

4: Compute M := Q'^AA'^Q G R'?'=X'?'=. 

5: Set Ufe to the top k singular vectors of M. 

6: return Z = QUk- 


2.2 Stronger Guarantees 

In addition to runtime improvements, we target a much stronger notion of approximate SVD that is 
needed for many applications, but for which no gap-independent analysis was known. 

Specifically, as noted in ll2^ . while intuitively stronger than Frobenius norm error, (1 -f e) spec¬ 
tral norm low-rank approximation error does not guarantee any accuracy in Z for many matricefl 
Consider A with its top fc -I- 1 squared singular values all equal to 10 followed by a tail of smaller 
singular values (e.g. lOOOfc at 1). ||A —Afc||| = 10 but in fact || A — ZZ^A||| = 10 for any rank 
k Z, leaving the spectral norm bound useless. At the same time, || A — Afe|||- is large, so Frobenius 
error is meaningless as well. For example, any Z obtains || A — ZZ^A|||. < (1.01)||A-A,||^. 

With this scenario in mind, it is unsurprising that low-rank approximation guarantees fail as an 
accuracy measure in practice. We ran a standard sketch-and-solve approximate SVD algorithm 
(see Section ini i on SNAP/amazon 0302, an Amazon product co-purchasing dataset Il23ll2^ . and 
achieved very good low-rank approximation error in both norms for k = 30: 

||A-ZZ^A||f < 1.001||A-Afellf and || A - ZZ^Ajja < 1.038||A - Afc|| 2 . 

However, the approximate principal components given by Z are of significantly lower quality than 
A’s true singular vectors (see Figure[Tll. We saw a similar phenomenon for the popular 20 NEWS- 
GROUPS dataset ll25l and several others. Additionally, the potential failure of low rank approxima¬ 
tion measures was recently raised in Il22ll . 

We address this issue by introducing a per vector guarantee that requires each approximate singular 
vector Zi,..., Zfc to capture nearly as much variance as the corresponding true singular vector: 

Per Vector Error: Vi, |uf AA^u^ — zf AA^z^j < e(T^_i_ 2 . (3) 

The error bound Q is very strong in that it depends on meaning that it is better then relative 

error, i.e. |uf AA^u^ — zf AA^z^ | < ecrf, for A’s large singular vectors. While it is reminiscent 
of the bounds sought in classical numerical analysis ||26l, we stress that it does not require each z^ to 
converge to in the presence of small singular value gaps. In fact, we show that both randomized 


"'in fact, it does not even imply (1 + e) Frobenius norm error. 
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Figure 1: Poor per vector error Q for SNAP/amazon 0302 returned by a sketch-and-solve ap¬ 
proximate SVD that gives very good low-rank approximation in both spectral and Frobenius norm. 


Block Krylov Iteration and our slightly modified Simultaneous Iteration algorithrr(3 achieve (O in 
gap-independent runtimes. 

2.3 Main Result 

Our contributions are summarized in Theorem[Tl whose proof appears in parts as Theorems |6] and |7] 
in Section|5](runtime) and Theorems [Tol fTTl and[T2]in Section|6](accuracy). 

Theorem 1 (Main Theorem). With high probability, Algorithms\I\and^find approximate singular 
vectors Z = [zi,, Zfc] satisfying guarantees ([T]) and ©/or low-rank approximation and ©/or 
PCA. For error e, Algorithm\J\requires q = 0{\ogd/e) iterations while Algorithm^requires q = 
0(log djy/e) iterations. Excluding lower order terms, both algorithms run in time 0(nnz(A)fcq). 

We note that, while Simultaneous Iteration was known to achieve 0 ins, surprisingly we are first 
to prove that it gives O, a qualitatively weaker goal. 

In Section |7] we use our results to give an alternative analysis of both algorithms that does depend 
on singular value gaps and can offer significantly faster convergence when A has decaying singular 
values. It is possible to take further advantage of this result by running Algorithms[T]and|2]with a 11 
that has > k columns, a simple modification for accelerating either method. 

Finally, Section [8] contains a number of experiments on large data problems. We justify the im¬ 
portance of gap independent bounds for predicting algorithm convergence and we show that Block 
Krylov Iteration in fact significantly outperforms the more popular Simultaneous Iteration. 

2.4 Comparison to Classical Bounds 

Decades of work has produced a variety of gap dependent bounds for power iteration and Krylov 
subspace methods. We refer the reader to Saad’s standard reference Ell. Most relevant to our 
work are bounds for block Krylov methods with block size equal to k 1281 . Roughly speaking, with 
randomized initialization, these results offer guarantees equivalent to our strong equation © for the 
top k singular directions after: 

O { toations. 

This bound is recovered by our Section |7] results and, when the target accuracy e is smaller than the 
relative singular value gap (crfe/cTfc+i — 1), it is tighter than our gap independent results. However, 
as discussed in Section [8l for high dimensional data problems where e is set far above machine 
precision, gap independent bounds more accurately predict required iteration count. 

^For guarantee © it is important that Algorithm [T] includes post-processing steps 4 and 5 rather than just 
returning a basis for K, which is sufficient for the low-rank approximation guarantees. 
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Less comparable to our results are attempts to analyze algorithms with block size smaller than k 
1261. While “small block” or single vector algorithms offer runtime advantages, it is well understood 
that with b duplicate singular values, it is impossible to recover the top k singular directions with a 
block of size < b ll29l . More generally, large singular value clusters slow convergence, so any small 
block algorithm must have runtime dependence on the gaps between each adjacent pair of top k 
singular values We believe that obtaining simpler theoretical bounds for small block methods 
is an interesting direction for future work. 

3 Background and Intuition 

We will start by 1) providing background on algorithms for approximate singular value decom¬ 
position and 2) giving intuition for Simultaneous Power Iteration and Block Krylov methods and 
justifying why they can give strong gap-independent error guarantees. 

3.1 Frobenius Norm Error 

Progress on algorithms for Frobenius norm error low-rank approximation ([Til has been considerable. 
Work in this direction dates back to the strong rank-revealing QR factorizations of Gu and Eisenstat 
EH. They give deterministic algorithms that run in approximately 0(ndk) time, vs. Oindf) for a 
full SVD, but only guarantee polynomial factor Frobenius norm error. 

Recently, randomization has been applied to achieve even faster algorithms with (1 + e) error. The 
paradigm is to compute a linear sketch of A into very few dimensions using either a column sam¬ 
pling matrix or Johnson-Lindenstrauss random projection matrix IT. Typically AIT has at most 
poly(fc/e) columns and can be used to quickly find Z. Specifically, Z is typically taken to be the top 
k left singular vectors of AH or of A projected onto ATE Il^ l4l. 

^nxd ^ TEf^xpoly^/c/e) = (AIT) poly(fc/e) 

This approach was developed and refined in several pioneering results, including ll^ O [^ 
for column sampling, Ena for random projection, and definitive work by Sarlos 0. Recent 
work on sparse Johnson-Lindenstrauss type matrices ||6]|38][^ has significantly reduced the cost of 
multiplying ATI, bringing the cost of Frobenius error low-rank approximation down to O (nnz (A) -b 
n poly(fc/e)) time, where the first term is considered to dominate since typically k n,d. 

The sketch-and-solve method is very efficient - the computation of AH is easily parallelized and, 
regardless, pass-efficient in a single processor setting. Furthermore, once a small compression of A 
is obtained, it can be manipulated in fast memory to find Z. This is not typically true of A itself, 
making it difficult to directly process the original matrix at all. 

3.2 Spectral Norm Error via Simultaneous Iteration 

Unfortunately, as discussed, Frobenius norm error is often insufficient when A has a heavy singular 
value tail. Moreover, it seems an inherent limitation of sketch-and-solve methods. The noise from 
A’s lower r — k singular values corrupts ATI, making it impossible to extract a good partial SVD 
if the sum of these singular values (equal to ||A — Afc|||.) is too large. In other words, any error 
inherently depends on the size of this tail. 

In order to achieve spectral norm error (|2]i. Simultaneous Iteration must reduce this noise down to 
the scale of Ok+i = ||A — Afc|| 2 . It does this by working with the powered matrix A'? ||401I4T1 FI 
By the spectral theorem. A"? has exactly the same singular vectors as A, but its singular values are 
equal to the singular values of A raised to the q* power. Powering spreads the values apart and 
accordingly, A'^’s lower singular values are relatively much smaller than its top singular values (see 
Figure |2a] for an example). 

Specifically, q = is sufficient to increase any singular value > (1 -b e)ak+i to be signifi¬ 

cantly (i.e. poly(d) times) larger than any value < ak+i- This effectively denoises our problem - 
if we use a sketching method to find a good Z for approximating A® up to Frobenius norm error, Z 
will have to align very well with every singular vector with value > (1 -b e)ak+i- It thus provides 
an accurate basis for approximating A up to small spectral norm error. 

®For nonsymmetric matrices we work with (AA^)'^A, but present the symmetric case here for simplicity. 
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(a) A’s singular values compared to those of A^, 
rescaled to match on cri. Notice the significantly 
reduced tail after ag. 


(b) An 0(l/v^)-degree Chebyshev polynomial, 
To(i/^) (a;), pushes low values nearly as close to 
zero as while spreading higher values less 

significantly. 


Figure 2: Replacing A with a matrix polynomial facilitates higher accuracy approximation. 


Computing A‘‘ directly is costly, so A'^II is computed iteratively. We start with a random 11 and 
repeatedly multiply by A on the left. Since even a rough Frobenius norm approximation for A'^ 
suffices, n is often chosen to have just k columns. Each iteration thus takes 0(nnz(A)A:) time. 
After A'^n is computed, Z can simply be set to a basis for its column span. 

To the best of our knowledge, this approach to analyzing Simultaneous Iteration without dependence 
on singular value gaps began with m. The technique was popularized in IJ) and its analysis im¬ 
proved in ifTSl and iflhll . lfT4l gives the first bound that directly achieves (|2|i with 0(log d/e) power 
iterations. All of these papers rely on an improved understanding of the benefits of starting with a 
randomized 11, which has developed from work on the sketch-and-solve paradigm. 

3.3 Beating Simultaneous Iteration with Krylov Methods 

As mentioned, numerous papers hint at the possibility of beating Simultaneous Iteration with block 
Krylov methods lfT8l[T9ll28]| . In particular, [T], 1201 and ETI suggest and experimentally confirm the 
potential of a randomized variant of the Block Lanczos algorithm, which we refer to as Block Krylov 
Iteration (Algorithmic. However, none of these papers give theoretical bounds on the algorithm’s 
performance. 

The intuition behind Block Krylov Iteration matches that of many accelerated iterative methods. 
Simply put, there are better polynomials than A'^ for denoising tail singular values. In particular, 
we can use a lower degree polynomial, allowing us to compute fewer powers of A and thus leading 
to an algorithm with fewer iterations. For example, an appropriately shifted q — degree 

Chebyshev polynomial can push the tail of A nearly as close to zero as even if the long 

run growth of the polynomial is much lower (see Figure l2b]i. 

Block Krylov Iteration takes advantage of such polynomials by working with the Krylov subspace, 

K = [n An A^n A^n ... A«n], 

from which we can construct pq(A)n for any polynomial Pq{ ) of degree gQ Since an effective 
polynomial for denoising A must be scaled and shifted based on the value of ak+i, we cannot easily 
compute it directly. Instead, we argue that the very best k rank approximation to A lying in the span 
of K at least matches the approximation achieved by projecting onto the span of pq{A)Il. Finding 
this best approximation will therefore give a nearly optimal low-rank approximation to A. 

Unfortunately, there’s a catch. Perhaps surprisingly, it is not clear how to efficiently compute the 
best spectral norm error low-rank approximation to A lying in a specific subspace (e.g. K’s span) 
CSlIlll. This challenge precludes an analysis of Krylov methods parallel to the recent work on 

’ Algorithml^in fact only constructs odd powered terms in K, which is sufficient for our choice of pq (x). 
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Simultaneous Iteration. Nevertheless, we show that computing the best Frobenius error low-rank 
approximation in the span of K, exactly the post-processing step taken by classic Block Lanczos 
and our method, will give a good enough spectral norm approximation for achieving (1 -f e) error. 

3.4 Stronger Per Vector Error Guarantees 

Achieving the per vector guarantee of Q requires a more nuanced understanding of how Simultane¬ 
ous Iteration and Block Krylov Iteration denoise the spectrum of A. The analysis for spectral norm 
low-rank approximation relies on the fact that A'^ (or Pq{A) for Block Krylov Iteration) blows up 
any singular value > (1 -f e)(Tfe+i to much larger than any singular value < ak+i- This ensures that 
the Z outputted by both algorithms aligns very well with the singular vectors corresponding to these 
large singular values. 

If CTfc > (1 -f e)(Tfc+i, then Z aligns well with all top k singular vectors of A and we get good 
Frobenius norm error and the per vector guarantee Q. Unfortunately, when there is a small gap 
between ak and at+i, Z could miss intermediate singular vectors whose values lie between ak+i 
and (1 + e)ak+i- This is the case where gap dependent guarantees of classical analysis break down. 

However, A'^ or, for Block Krylov Iteration, some g-degree polynomial in our Krylov subspace, also 
significantly separates singular values > ak+i from those < (1 — e)ak+i- Thus, each column of Z 
at least aligns with A nearly as well as u^+i. So, even if we miss singular values between ak+i and 
(1 + e)ak+i, they will be replaced with approximate singular values > (1 — e)ak+i, enough for Q. 

For Frobenius norm low-rank approximation, we prove that the degree to which Z falls outside of 
the span of A’s top k singular vectors depends on the number of singular values between ak+i and 
(1 — e)ak+i- These are the values that could be ‘swapped in’ for the true top k singular values. Since 
their weight counts towards A’s tail, our total loss compared to optimal is at worst e|| A — Afe|||,. 


4 Preliminaries 


Before proceeding to the full technical analysis, we overview required results from linear algebra, 
polynomial approximation, and randomized low-rank approximation. 


4.1 Singular Value Decomposition and Low-Rank Approximation 


Using the SVD, we compute the pseudoinverse of A S as A+ = Additionally, 

for any polynomial p(a;), we define p{A) — Up(S)V'^. Note that, since singular values are always 
take to be non-negative, p(A)’s singular values are given by |p(S)|. 


Let Sfc be S with all but its largest k singular values zeroed out. Let Uj. and be U and V with 
all but their first k columns zeroed out. For any k, Ak = is the closest rank 

k approximation to A for any unitarily invariant norm, including the Frobenius norm and spectral 
norm ES. The squared Frobenius norm is given by || A|||. = ■ A^^ = tr(AA'^) = af. 

The spectral norm is given by || A ||2 = cti. 

||A-Afc||i;’= min HA-BHf and ||A-Afe|| 2 = min ||A-B|| 2 . 

B|rank(B)=fe B|rank(B)=/L 


We often work with the remainder matrix A—Ak and label it A^\k- Its singular value decomposition 
is given by A^\k = r\k'^r\k^r\k where Ur\fe. ^r\k^ have their first k columns zeroed. 


While the SVD gives a globally optimal rank k approximation for A, both Simultaneous Iteration 
and Block Krylov Iteration return the best k rank approximation falling within some fixed subspace 
spanned by a basis Q (with rank > k). For the Frobenius norm, this simply requires projecting A to 
Q and taking the best rank k approximation of the resulting matrix using an SVD. 

Lemma 2 (Lemma 4.1 of |[T4l l. Given A € and Q S with orthonormal columns, 

||A-(QQ^A)fc||;^ = ||A-Q(Q^A)Jl^= min ||A-QC||^. 

C|rank(C)=fe 

This low-rank approximation can be obtained using an SVD (equivalently, eigendecomposition) of 
the m X m matrix M = Q^(AA^)Q. Specifically, letting M = US^U^, then: 

(QUfc) (QUfc)^A = Q(Q^A)^. 
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If the SVD of Q^A gwen by Q^A_= USV^ then M = Q^(AA^)Q = US^U^. So 
Q (Q^A)^ = QtjfcSfeV^ = Q (Ufetj|’) USV^ = QU^U^Q^A, giving the lower matrix 
equality. Note that QU^ has orthonormal columns since U^Q^QU;. = U^IU;, = 1^,. 

In general, this rank k approximation does not give the best spectral norm approximation to A 
falling within Q ifT^ . A closed form solution can be obtained using the results of ll42l . which are 
related to Parrott’s theorem, but we do not know how to compute this solution without essentially 
performing an SVD of A. It is at least simple to show that the optimal spectral norm approximation 
for A spanned by a rank k basis is obtained by projecting A to the basis: 

Lemma 3 (Lemma 4.14 of lfT4ll '). For A S and Q £ with orthonormal columns, 

||A-QQ^A|| 2 =mm||A-QC|l 2 . 


4.2 Other Linear Algebra Tools 

Throughout this paper we use span(M) to denote the column span of the matrix M. We say that 
a matrix Q is an orthonormal basis for the column span of M if Q has orthonormal columns and 
QQ^M = M. That is, projecting the columns of M to Q fully recovers those columns. QQ^ is 
the orthogonal projection matrix onto the span of Q. (QQ )(QQ ) = QIQ = QQ . 

If M and N have the same dimension and = 0 then ||M + N|||. = ||M|||;, + ||N|||.. This 

matrix Pythagorean theorem follows from writing ||M + N|||. = tr((M + N)(M + N)'^). As an 
example, for any orthogonal projection QQ^ A, A^(I—QQ^)QQ^A = 0, so ||A—QQ^A|||. = 
II A|||. — ||QQ^A|||,. This implies that, since A^ = UfcU^’A minimizes || A — Afc|||. over all rank 
fc matrices, QQ^ = UfeUfe maximizes ||QQ^A|||. over all rank fc orthogonal projections. 

4.3 Randomized Low-Rank Approximation 

Our proofs build on well known sketch-based algorithms for low-rank approximation with Frobenius 

norm error. A short proof of the following Lemma is in AppendixiAl 

Lemma 4 (Frobenius Norm Low-Rank Approximation). Take any A £ and Yi 

the entries o/II are independent Gaussians drawn from Af(0, 1). If we let Z be an orthonormal basis 

for span (All), then with probability at least 99/100,/or some fixed constant c, 

||A-ZZ^A|||,<c-dfc||A-Afc|||. 

For analyzing block methods, results like Lemma|4]can effectively serve as a replacement for earlier 
random initialization analysis that applies to single vector power and Krylov methods ll44l . 

4.4 Chebyshev Polynomials 

As outlined in Section l331 our proof also requires polynomials to more effectively denoise the tail of 
A. As is standard for Krylov subspace methods, we use a variation on the Chebyshev polynomials. 
The proof of the following Lemma is relegated to AppendixiAl 

Lemma 5 (Chebyshev Minimizing Polynomial). Given a specified value a > 0, gap 7 £ (0,1], 
and q>l, there exists a degree q polynomial p{x) such that: 

1. p{{\ + y)a) = (1 + y)a 

2. p{x) > X for all a: > (1 -f 7)0 

3. |p(a:)| < for all X £ [0,0;] 

Furthermore, when q is odd, the polynomial only contains odd powered monomials. 

5 Implementation and Runtimes 

We first briefly discuss runtime and implementation considerations for Algorithms [T] and |2l our 
randomized implementations of Simultaneous Power Iteration and Block Krylov Iteration. 
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5.1 Simultaneous Iteration 

Algorithm [U can be modified in a number of ways. 11 can be replaced by a random sign matrix, or 
any matrix achieving the guarantee of Lemma |4] 11 may also be chosen with p > k columns. We 
will discuss in detail how this approach can give improved accuracy in Section|7] 

In our implementation we set Z = QU^. This ensures that, for all Z < fc, Z; gives the best rank 
I Frobenius norm approximation to A within the span of K (See Lemma |2]). This is necessary 
for achieving per vector guarantees for approximate PCA. However, if we are only interested in 
computing a near optimal low-rank approximation, we can simply set Z = Q. Projecting A to 
Qtik is equivalent to projecting to Q as these two matrices have the same column spans. 

Additionally, since powering A spreads its singular values, K = (AA^)'^An could be poorly 
conditioned. As suggested in ||45]| . to improve stability we can orthonormalize K after every iteration 
(or every few iterations). This does not change K’s column span, so it gives an equivalent algorithm 
in exact arithmetic, but improves conditioning significantly. 

Theorem 6 (Simultaneous Iteration Runtime). Algorithm\I\mns in time 

^ f / * , nk'^logd 

V e e 


Proof. Computing K requires first multiplying A by 11, which takes 0(nnz(A)fc) time. Comput¬ 
ing (AA^)* An given (AA^)* ^ AH then takes 0(nnz(A)fc) time to first multiply our (n x k) 
matrix by A^ and then by A. Reorthogonalizing after each iteration takes 0{nk‘^) time via Gram- 
Schmidt or Householder reflections. This gives a total runtime of 0(nnz(A)fc(7 + nkfq) for com¬ 
puting K. 

Finding Q takes 0(nk^) time. Computing M by multiplying from left to right requires 
0{nnz{A)k fr nk'^) time. M’s SVD then requires O(k^) time using classical techniques. Finally, 
multiplying by Q takes time O(nk^). Setting q = 0(logd/e) gives the claimed runtime. □ 


5.2 Block Krylov Iteration 


As with Simultaneous Iteration, we can replace H with any matrix achieving the guarantee of 
Lemma |4] and can use p > k columns to improve accuracy. Q can also be computed in a num¬ 
ber of ways. In the traditional Block Lanczos algorithm, one starts by computing an orthonormal 
basis for AH, the first block in the Krylov subspace. Bases for subsequent blocks are computed 
from previous blocks using a three term recurrence that ensures Q^AA^Q is block tridiagonal, 
with k X k sized blocks ||T9l . This technique can be useful if qk is large, since it is faster to compute 
the top singular vectors of a block tridiagonal matrix. However, computing Q using a recurrence 
can introduce a number of stability issues, and additional steps may be required to ensure that the 
matrix remains orthogonal Il29ll . 

An alternative is to compute K explicitly and then compute Q using a QR decomposition. This 
method is used in [I] and IMl . It does not guarantee that Q^AA^Q is block tridiagonal, but helps 
avoid a number of stability issues. Furthermore, if qk is small, taking the SVD of Q^AA^Q will 
still be fast and typically dominated by the cost of computing K. 

As with Simultaneous Iteration, we can also orthonormalize each block of K after it is computed, 
avoiding poorly conditioned blocks and giving an equivalent algorithm in exact arithmetic. 

Theorem 7 (Block Krylov Iteration Runtime). Algorithm^runs in time 


O 


^nnz(A) 


fclogd 


log^ d 
e 


log^d\ 
e3/2 ) • 


Proof. Computing K, including block reorthogonalization, requires 0(nnz(A)fcq -f nk^q) time. 
The remaining steps are analogous to those in Simultaneous Iteration except somewhat more costly 
as we work an fc • q dimensional rather than k dimensional subspace. Finding Q takes 0{n{kq)^) 
time. Computing M take 0{nnz{A){kq) + nfkqY) time and its SVD then requires 0{{kqY) time. 
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Finally, multiplying Ufe by Q takes time 0(nk(kq)). Setting q = 0(logd/-ye) gives the claimed 
runtime. □ 

6 Error Bounds 

We next prove that both Algorithms [T] and |2] return a basis Z that gives relative error Frobenius O 
and spectral norm (|2]l low-rank approximation error as well as the per vector guarantees Q. 

6.1 Main Approximation Lemma 

We start with a general approximation lemma, which gives three guarantees formalizing the intuition 
given in Section[3] All other proofs follow nearly immediately from this lemma. 

For simplicity we assume that k < r = rank(A) < n, d. However, if fc > r it can be seen that both 
algorithms still return a basis satisfying the proven guarantees. We start with a definition: 

Definition 8. For a given matrix Z S with orthonormal columns, letting Z; S be the 

first I columns o/Z, we define the error function: 

£{7.uA) = \\Ai\\l-\\ZiZjA\\l 

= \\A-ZiZjA\\l-\\A-Ai\\l. 

Recall that A/ is the best rank I approximation to A. This error function measures how well ZiZf A 
approximates A in comparison to the optimal. 

Lemma 9 (Main Approximation Lemma). Let m be the number of singular values ai of A with 
cTi > (1 + e/2)(Tfe_|_i. Let w be the number of singular values with if CTi < Uk- With 

probability 99/100 Algorithms\I]and^return Z satisfying: 

1. Vl<m,£{Zi,A)<{e/2)-al^,, 

2. \fl<k,£iZi,A) <£{Zi_i,A)+3e-al^^, 

3. Ml < k, £{Zi,A) < (lu -b 1) • 3e • 

Property [T]captures the intuition given in Section [T2l Both algorithms return Z with Z; equal to the 
best Frobenius norm low-rank approximation in span(K). Since cti > ... > am > (1 + e/2)crfc+i 
and our polynomials separate any values above this threshold from anything below ak+i, Z must 
align very well with A’s top m singular vectors. Thus £{Zi, A) is very small for all I < m. 

Property |2] captures the intuition of Section l3Al - outside of the largest m singular values, Z still 
performs well. We may fail to distinguish between vectors with values between j:p^ak and (1 -b 

e/2)crfc+i. However, aligning with the smaller vectors in this range rather than the larger vectors can 
incur a cost of at most 0{e)a‘f.j^^. Since every column of Z outside of the first m may incur such a 
cost, there is a linear accumulation as characterized by Property|2] 

Finally, Property [3 captures the intuition that the total error in Z is bounded by the number of 
singular values falling in the range j^^ak < < ak- This is the total number of singular vectors 

that aren’t necessarily separated from and can thus be ‘swapped in’ for any of the {k — m) true 
top vectors with singular value < (1 -b e/2)ak+i- Property |3is critical in achieving near optimal 
Frobenius norm low-rank approximation. 

Proof. Proof of Property [I] 

Assume m > 1. If m = 0 then Property[T]trivially holds. We will prove the statement for Algorithm 
12] since this is the more complex case, and then explain how the proof extends to Algorithm]!] 

Let Pi be the polynomial from Lemma|5]with a = ak+i, 7 = e/2, and q > c\og{d/e)/y/e for 
some fixed constant c. We can assume 1/e = 0(poly d) and thus q = 0{logd/^). Otherwise our 
Krylov subspace would have as many columns as A and we may as well use a classical algorithm 
to compute A’s partial SVD directly. Let Yi G be an orthonormal basis for the span of 
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Pi(A)n. Recall that we defined pi(A) = Upi(S)V^. As long as we choose q to be odd, by 
the recursive definition of the Chebyshev polynomials, pi (A) only contains odd powers of A (see 

/ s(i-l)/2 

LemmalSJl. Any odd power i can be evaluated as ( AA 1 A. Accordingly, pi (Ajll and thus 

Yi have columns falling within the span of the Krylov subspace from Algorithm |2] (and hence its 
column basis Q). 

By Lemma|4]we have with probability 99/100: 

Ibi(A) - YiYfpi(A)||| < c(ifc|bi(A) -pi(A)fe|||. (4) 

Furthermore, one possible rank k approximation of pi (A) is pi (A^). By the optimality of pi (A)^, 


Ibi(A)-pi(A)fe||^ < Ibi(A)-pi(Afc)|||. < ^ Pi{(Tif 




< d 


-( 


>k+l 


y 22 <j\/^-2 


= o( 


/ e 


V2d2 


^k+l 


The last inequalities follow from setting q = 0(log(d/e)/ ^/e) and from the fact that ai < crk+i = a 
for all i > k + 1 and thus by property 3 of Lemma|2 bi (^*)l ^ ,,^-1 ■ Noting that k < d, we 
can plug this bound into Q to get 


|bi(A)-YiYfpi(A)||^< 


(5) 


Applying the Pythagorean theorem and the invariance of the Frobenius norm under rotation gives 

|bi(S)f^ - ^ < ||YiY/’Upi(S)||^. 

Yi falls within A’s column span, and therefore U’s column span. So we can write Yi = UC for 
some C G Since Yi and U have orthonormal columns, so must C. We can now write 


lbi(s)|||- 


ecr; 


fe + 1 


< IIUCC^ U' Upi(S)|b = ||UCC'pi(S)|l^ = ||CVi(S)||^. 


Letting be the row of C, expanding out these norms gives 




_ ,2 ^^k+1 

^ 


< YI 11^*112^1(0-* 


(6) 


Since C’s columns are orthonormal, its rows all have norms upper bounded by 1. So ||ci|| 2 Pi(cri)^ < 
Pi(o'i)^ for all i. So for all I < r, ^ gives us 


^(1 - \\c^\\^)pi{(Tif < ^(1 - \\ci\\l)pi{aif < 


ea 


/c+l 


Recall that m is the number of singular values with (Ji > (1 + e/2)(jfc+i. By Property 2 of Lemma 
[21 for alH < m we have ai < pi(ai). This gives, for all I < m: 


Ea- 

i 

E-*^- 




ea 


fe+i 


ea 


fc+i 


^Ei 


and so 




Converting these sums back to norms yields ||Si|||- 
^ < IIYiYf A/|2 and 


- ^ < lic^s 


and therefore ||A;|||, - 


|A,|||-||YiYfA,||^< 


(7) 
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Now Y 1 Y^ Ai is a rank I approximation to A falling within the column span of Y and hence within 
the column span of Q. By Lemma|2] the best rank I Frobenius approximation to A within Q is given 
by QU;(QU;)^A. So we have 


IIAzlll - IIQUiCQUifAIII = £iZi,A) < 


ea 


2 

fe+i 


2 


giving Property [T] 


/ \2g+l 

For Algorithm [U we instead choose pi{x) = (1 + ej2)ak+i ■ ( (i+e/ 2 )iTk+i ) ' ^ ~ 

0(logd/e), this polynomial satisfies the necessary properties: for all i > k + 1, Pi{a-i) < 
O and for all i < m, ai < pi{ai). Further, up to a rescaling, pi(A)n = K so Yi 

spans the same space as K. Therefore since Algorithm [T] returns Z with Z; equal to the best rank I 
Frobenius norm approximation to A within the span of K, for all I we have: 


WQtJiiQtJifAWl > IIYiYf A,f^ > IIAilll 


eo'fc+i 

2 


giving the proof. 


Proof of Property |2] 

Property [Hand the fact that £’(Z;, A) is always positive immediately gives Property [2] for I < m. So 
we need to show that it holds for m < I < k. Note that if w, the number of singular values with 
1 + 6/2 ^ ^ is equal to 0, then au+i < so m = k and we are done. So we assume 

w > 1 henceforth. Again, we first prove the statement for Algorithm [2] and then explain how the 
proof extends to the simpler case of Algorithm[T] 

Intuitively, Property [T] follows from the guarantee that there is a rank m subspace of span(K) that 
aligns with A nearly as well as the space spanned by A’s top m singular vectors. To prove Property 
|2]we must show that there is also some rank k subspace in span(K) whose components all align 
nearly as well with A as u^, the fc* singular vector of A. The existence of such a subspace ensures 
that Z performs well, even on singular vectors in the intermediate range [cr^, (1 + e/2)crfc+i]. 

Let p 2 be the polynomial from Lemma|5]with a = 7 = e/2, and q > c log(d/e )/for 

some fixed constant c. Let Y 2 G be an orthonormal basis for the span of p 2 (A)n. Again, 

as long as we choose q to be odd, P 2 (A) only contains odd powers of A and so Y 2 falls within the 
span of the Krylov subspace from Algorithm|2] We wish to show that for every unit vector x in the 
column span of Y 2 , ||x^A ||2 > 

Let Ainner = A^\fc - Ainner = Ul^inner'V'^ where Hinner Contains Only the singular 

values CTfc+i,... ,ak+w These are the w intermediate singular values of A falling in the range 

l+e/2^fe: ■ Let -^outer — -dinner — USouterV^. S outer contains all large singular 

values of A with at > ak and all small singular values with at < 

Let Yinner S oithonormal basis for the columns of p 2 (Ai„„er)n. Similarly 

let Youter- e be an orthonormal basis for the columns of p 2 (Aouter)n. 

Every column of Yi„„er falls in the column span of Ainner and hence the column span of XJinner G 
which contains only the singular vectors of A corresponding to the inner singular values. 
Similarly, the columns of Youter fall within the span of XJ outer G which contains the re¬ 

maining left singular vectors of A. So the columns of Y^ner are orthogonal to those of Yo„ter and 
[Yinner,Youter] forms an orthogonal basis. For any unit vector x £ span(p 2 (A)n) = span{Y 2 ) 
we can write x = dinner + Pouter where dinner and Xouter are orthogonal vectors in the spans of 
Yinner and Youter respectively. We have: 

||x^A||^ = l|x^„erA||2 + [[^outerMll (8) 


We will lower bound ||x^A||| by considering each contribution separately. First, any unit vector 
x' £ R" in the column span of Yinner can be written as x' = XJ inner ^ where z £ R’" is a unit 
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vector. 


||x''^A|| 2 = z'^Uf„„^^AA'^Ui„„erZ = > (1 - e)al. (9) 

Note that we’re abusing notation slightly, using 'Eisner G to represent the diagonal matrix 

containing all singular values of A with < (Ji < ak without diagonal entries of 0. 

We next apply the argument used to prove Property[T]to p 2 {Aouter)^- The {k + 1)* singular value 
of Aouter is equal to crfe+u,+i < = a. So applying O we have for all I < k, 

IIAilll - II (Youter)i iYouter)J (10) 

Note that Aouter has the same top k singular vectors at A so (Aouter)i = A;. Let x' S R" be any 

unit vector within the column space of Youter and let Youter = (I ~ ^'^''^)Youter, i-e the matrix 

with x' projected off each column. We can use (fTOl i and the optimality of the SVD for low-rank 

approximation to obtain: 

||A,f^-||Y„„*,,YLe.A,f^<^ 

IIAfclll - ||Y„„,,,yL,,A,|||. - ||xV"’a,||^ < ^ 

||Afc||?.-||A,_if^-^<||xV^Afc||| 

(l-e/2)a2<||x'^A||i (11) 


Plugging (|9]l and (fTTl i into ® yields that, for any x in span{Y 2 ), i.e. span(p 2 (A)n), 

II^^A||2 = ||x^„g^A||2 + ||Xo„(g^A||2 

> (||x™„e^||2 -f ||xo„ter||2) (1 - ejcTfc > (1 - e)cr^ (12) 

So, we have identified a rank k subspace Y 2 within our Krylov subspace such that every vector in 
its span aligns at least as well with A as u^. 


Now,foranyTO < I < k, consider iS(Z;, A). We know that given Z/_i, we can form a rank Z matrix 
Z; in our Krylov subspace simply by appending a column x orthogonal to the I — 1 columns of Z/_i 
but falling in the span of Y 2 . Since Y 2 has rank k, finding such a column is always possible. Since 
Z; is the optimal rank I Frobenius norm approximation to A falling within our Krylov subspace, 

£:(Z,, A) < £(%,A) = IIAzlll - ||Z,zfA||l. 

= af + ||A,_i||^ - ||Z,_iZr_iAf^ - llxx^Alll 
= £(Zi_i,A) + af - ||xx^A|||. 

< £(Zi_i, A) -I- (1 -I- e/2)^cr^_|_j — (1 — e)o'fc+i 

< iS(Z;_i, A) + 3e • cTfc+i, 

which gives Property |3 


/ \2g+l 

Again, a nearly identical proof applies for Algorithm [1] We just choose P 2 (a:) = U/t ( ^ 1 
For q — 0(logcZ/e) this polynomial satisfies the necessary properties: for all i > k, pi{(Ti) < 
O (^crl) and for all i<k,ai< P 2 (cri). 


Proof of Property 12 

By Properties [T| and |2] we already have, for all I < fc, £iZi,A) < + (l -m) ■ 3ecr^+i < 

(1 + fc — m) • 3e • So if fc — m < tu then we immediately have Property [2 

Otherwise, w<k — m so w<k and thus p 2 (Ainner)T^ S only has rank w. It has a null 

space of dimension k — w. Choose any z in this null space. Then p 2 (A)nz = p 2 (Ai„„er)nz + 
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P2(Ao„ter)nz = p2{Aouter)^2,. In Other words, p2(A)IIz falls entirely within the spanof Youte^. 
So, there is a fc — ui dimensional subspace of span(Y2) that is entirely contained in span{Y outer) ■ 

For I < m + w, then Properties [U and |2] already give us £{Zi, A) < + {I — m) ■ Secr^^^ < 

(tu +1) • 3e • So consider m + w < I < k. Given Z^, to form a rank I matrix Z/ in our Krylov 
subspace we need to append I — m orthonormal columns. We can choose min{fc — w — m,l — m} 
columns, Xi, from the k — w dimensional subspace within span (Y 2 ) that is entirely contained in 
span{Y outer) ■ If necessary (i.e. k — w — m < Z —m). We can then choose the remaining Z—(fc —w) 
columns X 2 from the span of Y 2 . 

Similar to our argument when considering a single vector in the span of A outer 7 letting Y outer = 
(I - XiXf) Y outer, we have by ([TOli; 


2 

IIAfcll^ - \^outerYl^,M\l - llXiXf Afell^ < ^ 

2 

||Afc||^ ~ ^ < ||XiX^Afc|||, 

^ a2-^<||XiXfA|||. 

i—k—min{k—w — m,l — m}-\-l 

By applying (fT^ directly to each column of X 2 we also have; 

{I + w — k)a1 — {I + w — k)eal < ||X 2 X^A|||. 

[l + w- k)al^, -il + w- k)eal^, < ||X 2 X^Af^. 

Assume that min{fc — w — m,l — m} = k — w — m. Similar calculations show the same result when 
min{fc — w — m,l — m} = I — m. We can use the above two bounds to obtain: 

£{Zi,A)<£{%,A) 

= \\At\\jr-mzjA\\% 

I 

= ^f + l|A™||?.-||Z^Z^A|||,-||XiXfA||?,-||X2X^A|||, 

2=m+l 

I k 2 

<£(Zm,A)+ ^ a^- ^ a'f +—^ — {I + w — k)al_^_l + {I + w — k)€(7l_^_l 

m-\-w 

< a1- waY + {l + w-k + Z/2)eaY 

< (Z + 3w — fc + 3/2)e(T^_|_^ 

< (ui + l)-3e-crfe+i, 

giving Property |3] for all Z < fc. □ 

6.2 Error Bounds for Simultaneous Iteration and Block Krylov Iteration 

With Lemma|9]in place, we can easily prove that Simultaneous Iteration and Block Krylov Iteration 
both achieve the low-rank approximation and PCA guarantees ([T]i, ©. 

Theorem 10 (Near Optimal Spectral Norm Error Approximation). With probability 99/100, Algo- 
rithms\I}and]2\return Z satisfying (|2|l.' 

||A-ZZ^A ||2 < (l + e)||A-Afe|| 2 . 
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Proof. Let TO be the number of singular values with (Ti > {l + e/2)ak+i- If to = 0 then we are done 
since any Z will satisfy ||A — ZZ^A ||2 < ||A ||2 = ci < (1 + e/2)crfc+i < (1 + e)||A — Afc|| 2 . 
Otherwise, by Property[T]of Lemma|9] 

£:(Z„,A)< ^ 

2 

||A - Z^Z^AII^ < ||A - 

Additive etTor in Frobenius norm directly translates to additive spectral norm error. Specifically, 
applying Theorem 3.4 of ll22l . which we also prove as LemmaffSlin AnnendixlAl 

||A - Z^Z^A||2 < IIA - A™||2 + 

2 

< (1 + €/2)al_^_i H < (1 + e)l|A — A^H^. (13) 

Finally, Z^Z^A = ZZ^A and so by Lemma[3]we have ||A — ZZ^AHj < ||A — ZmZ^A|||, 
which combines with (fOl) to give the result. □ 

Theorem 11 (Near Optimal Frobenius Norm Error Approximation). With probability 99/100, AL 
gorithms\I]and^return Z satisfying ([T]).' 

||A-ZZ^A||f< (l + e)||A-Afe||f. 

Proof By Property|3]of Lemma|9]we have; 

iS(Zi, A) < (ui + 1) • 3e • cTfc+i 

||A - ZZ^AIII < ||A - Ak\\% + (re + 1) • 3e • al+^. (14) 

w is defined as the number of singular values with < ai < (Tfe. So ||A - Ak\\% > 

w ■ ^ i 4 .g/ 2 ^fc) ■ into (fl4l i we have: 

||A — ZZ^A||p < ||A — Afcllp + (ru + 1) • 3e • < (1 + 10e)||A — A^Hl^. 

Adjusting constants on the e gives us the result. □ 

Theorem 12 (Per Vector Quality Guarantee). With probability 99/100, Algorithms\T\ond^return 
Z satisfying ©.• 

Vi, [ufAA’^Ui — zf AA^Zi| < ecr^+i. 

Proof First note that zfAA'^Zi < u^AA^u^. This is because zfAA^z^ = 
zf QQ^AA’^QQ^Zi = CTi(QQ^A)^ by our choice of z^. CTi(QQ^A)^ < cri(A)^ since ap¬ 
plying a projection to A will decrease each of its singular values (which follows for example from 
the Courant-Fischer min-max principle). Then by Property |2]of Lemma|9]we have, for all i < k, 

WMWl - WZ^ZfWl < WM-ifj, - ||Z,_iZf_if^ + 3eal+, 
o-/ < ||z,zfA|||-f 3eCT^+i = zfAA^Zi+3ecr^+i. 

af = u/’AA’^Ui, so simply adjusting constants on e gives the result. □ 

7 Improved Convergence With Spectral Decay 

In addition to the implementations of Simultaneous Iteration and Block Krylov Iteration given in 
Algorithms [1] and |2] our analysis applies to the common modification of running the algorithms 
with n e forp > fc milolEl- This technique can significantly accelerate both methods for 
matrices with decaying singular values. For simplicity, we focus on Block Krylov Iteration, although 
as usual all arguments immediately extend to the simpler Simultaneous Iteration algorithm. 
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In order to avoid inverse dependence on the potentially small singular value gap — 1, the num¬ 
ber of Block Krylov iterations inherently depends on 1 /This ensures that our matrix polynomial 
sufficiently separates small singular values from larger ones. However, when (7^ > (1 + e)crfc+i we 

can actually use q — Q ^log(c?/e)/y/minll, — I}^ iterations, which is sufficient for separat¬ 
ing the top k singular values significantly from the lower values. Specifically, if we set a = ak+i 
and 7 = — 1, we know that with q = 0 ^log((j/e)/Y/min{l, — 1}^, (|5]l still holds. We 

can then just follow the proof of Lemma|9]and show that Property [T] holds for all I < k (not just for 
Z < m as originally proven). This gives Property|2]and Property [3 trivially. 

Further, for p > k, the exact same analysis shows that q = Q ^log(d/e)/Y/min{l, — 1}^ 

suffices. When A’s spectrum decays rapidly, so < c ■ au for some constant c < 1 and some 
p not much larger than k, we can obtain significantly faster runtimes. Our e dependence becomes 
logarithmic, rather than polynomial; 

Theorem 13 (Gap Dependent Convergence). With probability 99/100,/or any p > k, Algorithm\I\ 
o reinitialized with IT ~ //"(O, returns Z satisfying guarantees ([T]), (|2|l, and da as long as we 

set q = Q (\og{d/e)/ ^min{l, — 1})) or© ^log(cZ/e)/^min{l, — 1}^, respectively. 

This theorem may prove especially useful in practice because, on many architectures, multiplying 
a large A by 2k or even lOfc vectors is not much more expensive than multiplying by k vectors. 
Additionally, it should still be possible to perform all steps for post-processing K in memory, again 
limiting additional runtime costs due to its larger size. 

Finally, we note that while Theorem [T3] is more reminiscent of classical gap-dependent bounds, it 
still takes substantial advantage of the fact that we’re looking for nearly optimal low-rank approxi¬ 
mations and principal components instead of attempting to converge precisely to A’s true singular 
vectors. This allows the result to avoid dependence on the gap between adjacent singular values, 
instead varying only with which should be much larger. 


8 Experiments 

We close with several experimental results. A variety of empirical papers, not to mention widespread 
adoption, already justify the use of randomized SVD algorithms. Prior work focuses in particular on 
benchmarking Simultaneous Iteration ll20l fl^ and, due to its improved accuracy over sketch-and- 
solve approaches, this algorithm is popular in practice iniini. As such, we focus on demonstrating 
that for many data problems Block Krylov Iteration can offer significantly better convergence. 

We implement both algorithms in MATLAB using Gaussian random starting matrices with exactly 
k columns. We explicitly compute K for both algorithms, as described in Section |5] and use re¬ 
orthonormalization at each iteration to improve stability ll45l . We test the algorithms with varying 
iteration count q on three common datasets, SNAP/amazon0302 1^12411 . SNAP/email-Enron 
Il23ll46l . and 20 NEWSGROUPS ll25l . computing column principal components in all cases. We plot 
error vs. iteration count for metrics ([TJ, (l2]l, and Q in Figure[3 For per vector error Q, we plot the 
maximum deviation amongst all top k approximate principal components (relative to ak+i). 

Unsurprisingly, both algorithms obtain very accurate Frobenius norm error, || A — ZZ^AUj’/H A — 
Afclli?, with very few iterations. This is our intuitively weakest guarantee and, in the presence of a 
heavy singular value tail, both iterative algorithms will outperform the worst case analysis. 

On the other hand, for spectral norm low-rank approximation and per vector error, we confirm that 
Block Krylov Iteration converges much more rapidly than Simultaneous Iteration, as predicted by 
our theoretical analysis. It it often possible to achieve nearly optimal error with < 8 iterations where 
as getting to within say 1% error with Simultaneous Iteration can take much longer. 

The final plot in Figure |3] shows error verses runtime for the 11269 x 15088 dimensional 20 NEWS- 
GROUPS dataset. We averaged over 7 trials and ran the experiments on a commodity laptop with 
16GB of memory. As predicted, because its additional memory overhead and post-processing costs 
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(a) SNAP/AMAZON0302, fc = 30 


(b) SNAP/email-Enron, fe = 10 



(c) 20 Newsgroups, fc = 20 



Runtime (seconds) 


(d) 20 Newsgroups, k = 20, runtime cost 


Figure 3; Low-rank approximation and per vector error convergence rates for Algorithms [T] and 


are small compared to the cost of the large matrix multiplication required for each iteration. Block 
Krylov Iteration outperforms Simultaneous Iteration for small e. 

More generally, these results justify the importance of convergence bounds that are independent of 
singular value gaps. Our analysis in Section|7]predicts that, once e is small in comparison to the gap 
— 1, we should see much more rapid convergence since q will depend on log(l/e) instead of 

1/e. However, for Simultaneous Iteration, we do not see this behavior with SNAP/amazon 0302 
and it only just begins to emerge for 20 NEWSGROUPS. 

While all three datasets have rapid singular value decay, a careful look confirms that their singular 
value gaps are actually quite small! For example, — 1 is .004 for SNAP/amazon 0302 and 
.011 for 20 Newsgroups, in comparison to .042 for SNAP/email-Enron. Accordingly, the 
frequent claim that singular value gaps can be taken as constant is insufficient, even for small e. 
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A Appendix 

Frobenius Norm Low-Rank Approximation 

We first give a deterministic Lemma, from which the main approximation result follows. 

Lemma 14 (Special case of Lemma 4.4 of fiAl . originally proven in Gl). Let A € have SVD A = 

USV^, lets € be any matrix such that rank (V^S) = k, and let C G R"^*^ be an orthonormal basis 

for the column span o/AS. Then: 

||A - CC^AIll < ||A - A,\\l + ||(A - Afc) S (v^s)"’ |||. 
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Leinina|4](Frobenius Norm Low-Rank Approximation). ForanyA £ andU £ where the entries 
o/n are independent Gaussians drawn from Af{0, 1). If we let 7i be an orthonormal basis for span (All), 
then with probability at least 99/100, for some fixed constant c, 

||A - ZZ^A|||. < c • dfc||A - Afc|||.. 

Proof. We follow 1141 . Apply Lemma [T4l with S = IT. With probability 1, V^S has full rank. So, to show 
the result we need to show that || (A — A^) S (V^ S)^ ||f’ < c||A — Ak\\% for some fixed c. For any two 
matrices M and N, ||MN||f < ||M||f||N|| 2 . This property is known as spectral submultiplicativity. Noting 
that ||Ur\fcSryfcII f = ||A — A/cHf and applying submultiplicativity, 

II (A-A,)s(v^s)’*’||| < ||U.y,S.y,||l.||V^\,S||i|| (v^s)^iii. 

By the rotational invariance of the Gaussian distribution, since the rows of are orthonormal, the entries 
of V^S and are independent Gaussians. By standard Gaussian matrix concentration results (Fact 

6 of m, also in EH), with probability at least 99/100, ||V)rv^j,S||| < ci • max{fc,r — k} < cid and 
II (v^s) + III < C 2 k for some fixed constants ci, C 2 . So, 

||U,\fcS,y,|||||V^\feS||||| (v^s)^ III < c- dfc||A - Afclll, 

for some fixed c, yielding the result. Note that we choose probability 99/100 for simplicity - we can obtain a 
result with higher probability by simply allowing for a higher constant c, which in our applications of Lemma 
|4]will only factor into logarithmic terms. □ 

Chebyshev Polynomials 

Lemma IHfChebyshev Minimizing Polynomial). Given a specified value a > 0, gap 7 £ (0,1], and q > 1, 
there exists a degree q polynomial p{x) such that: 

1. p((l -b 7 ) 0 ) = (1 + 7)0 

2. p{x) > xfor all X > {1 + 7)0 

3- |p(a:^)l < 3,^-1 for all x £ [0, a] 

Furthermore, when q is odd, the polynomial only contains odd powered monomials. 

Proof. The required polynomial can be constructed using a standard Chebyshev polynomial of degree q, Tq {x), 
which is defined by the three term recurrence: 

Mx) = 1 

ri(a:) = X 

Tq{x) = 2xTq-\{x) — Tq- 2 {x) 

Each Chebyshev polynomial satisfies the well known property that Tq{x) < 1 for all 2 : £ [—1,1] and, for 
3 ; > 1, we can write the polynomials in closed form l48l : 

Tq{x) = ix+v^^r + jx-v^^r^ 

For Lemma|^ we simply set: 

p(a:) = ( 1 -b 7 )a^^^^, (16) 

which is clearly of degree q and well defined since, referring to i fTSt . Tq{x) > 0 for all ® > 1. Now, 
p((l -b 7)a) = (1 -b 7 ) 0 : ^ = (1 + 7)a, 

so p{x) satisfies property 1. With property 1 in place, to prove that p{x) satisfies property 2, it suffices to show 
that p'{x) > 1 for all 2 ; > (1 -b 7 )a. By chain rule, 

p'{x) = t) p''( xja). 

’ 1 /( 1 + 7 ) 
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Thus, it suffices to prove that, for all a; > (1 + 7 ), 


(l+7)T'(a:)>r,(l+7)- (17) 

We do this hy showing that (1 + '))Tq (1 + 7 ) > Tq(l + 7 ) and then claim that Tq{x) > 0 for all a; > (1 + 7 ), 
so l ll7t holds for a; > (1 + 7 ) as well. A standard form for the derivative of the Chehyshev polynomial is 

^{ 2 q{T^-i+T ^-3 +...+Tt_) if g is even, 

^ I 2g {Tq—i + Tq—3 -f ... + T 2 ) + g if g is odd. 


can be verified via induction once noting that the Chehyshev recurrence gives Tq = 2xTq_i + 2Tq-i — 
Tq_ 2 - Since Ti{x) > 0 when x > 1, we can conclude that Tq{x) > 2qTq-\{x). So proving J17t for 
X = (1 + 7 ) reduces to proving that 


(l + 7)2gT,_i(l + 7 )>T,(l + 7 ). (19) 

Noting that, for a; > 1, (x + \/x'^ — 1) > 0 and {x — — 1) > 0, it follows from i fTst that 

Tg_i(x) ^(x + — 1) + {x— — 1)) > Tq{x), 


and thus 


Tq-i{x) 


< 2x. 


So, to prove GU, it suffices to show that 2(1 + 7 ) < (1 + 7 ) 2 g, which is true whenever g > 1. So Gill holds 
for all X = (1 + 7 ). 

Finally, referring to Gl, we know that T" must be some positive combination of lower degree Chehyshev 
polynomials. Again, since Ti{x) > 0 when x > 1, we conclude that Tq{x) > 0 for all x > 1. It follows 
that Tq[x) does not decrease above x = (1 + 7 ), so ( 117b also holds for all x > (1 + 7 ) and we have proved 
property 2 . 

To prove property 3, we first note that, by the well known property that Ti(x) < 1 for x G \—l,l\,Tq{x/a) < 
1 for X € [0, a]. So, to prove p{x) < , we just need to show that 


1 


< 


7)3(1+ 7) “ 2'Jv^-i' 
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Equation dlSb givesT 9 (l+ 7 ) > i(l+ 7 +\/(l + 7 )^ — 1 )"^ > When 7 < 1 , (1 + ^ 7 )^^'^ > 2 . 

Thus, (1 + Dividing by 2 gives Tq(l + 7 ) > , which gives ( 120b and thus property 3. 

Finally, we remark that it is well known that odd degree Chehyshev polynomials of the first kind only contain 
monomials of odd degree (and this is easy to verify inductively). Accordingly, since Pq{x) is simply a scaling 
of Tq(x), if we choose g to be odd, Pq{x) only contains odd degree terms. □ 


Additive Frobenius Norm Error Implies Additive Spectral Norm Error 

Lemma 15 (Theorem 3.4 of l22l ). For any A € let B G R"^‘* be any rank k matrix satisfying 

IIA — B|||. < IIA — Afclll- + 77. Then 

||A — BII 2 ^ ||A — Afc||2 F P- 


Proof. We follow the proof given in (23 nearly exactly, including it for completeness. By WeyTs monotonicity 
theorem (Theorem 3.2 in (221), for any two matrices X, Y G R"^'^ with n > d, for all i, j with i + j — 1 < n 
we have (Ji+j_i(X + Y) < CTi(X) + aj(X). If we write A = (A — B) + B and apply this theorem, then for 
all 1 > i > n — k, 


(Ti+fe(A) < ai(A - B) + (Tfc+i(B). 


21 











Note that if n < d, we can just work with and B^. Now, (Tfc+i(B) = 0 since B is rank k. Using the 
resulting inequality and recalling that || A — Ak ||f = '^hat: 

||A — B||jr ^ ||A — Afc||j? + 7] 

n n 

^a|(A-B)< af{A) + rj 

i=l i=k + l 

n—k n 

Y (A - B) < Y (A) + V 

i=l i=k+l 

n — k n 

ct?(A - B) + ^ cr,^(A) < Y ^H-^) + V 

i=2 i=k + l 

n n — k 

a?(A-B)< Y (A) - Y, (A) + rj 

i=k + l i=2 

af{A - B) < o-fc+ijA) + rj. 

CTfe+ijA) is equal to the squared top singular value of A — Ak (i.e. || A — AfcHl, so the lemma follows. □ 
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